Judy Jensen, P4 Final project
## [1] 113937 83
## 'data.frame': 113937 obs. of 30 variables:
## $ LoanStatus : Ord.factor w/ 12 levels "Current"<"FinalPaymentInProgress"<..: 3 1 3 1 1 1 1 1 1 1 ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ FirstRecordedCreditLine : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ OpenCreditLines : int 4 14 NA 5 19 17 7 6 16 16 ...
## $ TotalCreditLinespast7years : int 12 29 3 29 49 49 20 10 32 32 ...
## $ OpenRevolvingAccounts : int 1 13 0 7 6 13 6 5 12 12 ...
## $ OpenRevolvingMonthlyPayment: num 24 389 0 115 220 1410 214 101 219 219 ...
## $ InquiriesLast6Months : int 3 3 0 0 1 0 0 3 1 1 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 6 6 ...
## $ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 0 0 ...
## $ AmountDelinquent : num 472 0 NA 10056 0 ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ PublicRecordsLast10Years : int 0 1 0 0 0 0 0 1 0 0 ...
## $ PublicRecordsLast12Months : int 0 0 NA 0 0 0 0 0 0 0 ...
## $ RevolvingCreditBalance : num 0 3989 NA 1444 6193 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ AvailableBankcardCredit : num 1500 10266 NA 30754 695 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : Ord.factor w/ 8 levels "$0"<"$1-24,999"<..: 3 4 8 3 6 6 3 3 3 3 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : Date, format: "2007-09-12" "2014-03-03" ...
## $ LStatus : Ord.factor w/ 3 levels "good standing"<..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : num 2007 2014 2007 2012 2013 ...
## LoanStatus BorrowerAPR ProsperScore
## Current :56576 Min. :0.00653 Min. : 1.00
## Completed :38074 1st Qu.:0.15629 1st Qu.: 4.00
## Chargedoff :11992 Median :0.20976 Median : 6.00
## Defaulted : 5018 Mean :0.21883 Mean : 5.95
## Past Due (1-15 days) : 806 3rd Qu.:0.28381 3rd Qu.: 8.00
## Past Due (31-60 days): 363 Max. :0.51229 Max. :11.00
## (Other) : 1108 NA's :25 NA's :29084
## EmploymentStatus EmploymentStatusDuration IsBorrowerHomeowner
## Employed :67322 Min. : 0.00 False:56459
## Full-time :26355 1st Qu.: 26.00 True :57478
## Self-employed: 6134 Median : 67.00
## Not available: 5347 Mean : 96.07
## Other : 3806 3rd Qu.:137.00
## : 2255 Max. :755.00
## (Other) : 2718 NA's :7625
## CreditScoreRangeUpper FirstRecordedCreditLine CurrentCreditLines
## Min. : 19.0 : 697 Min. : 0.00
## 1st Qu.:679.0 1993-12-01 00:00:00: 185 1st Qu.: 7.00
## Median :699.0 1994-11-01 00:00:00: 178 Median :10.00
## Mean :704.6 1995-11-01 00:00:00: 168 Mean :10.32
## 3rd Qu.:739.0 1990-04-01 00:00:00: 161 3rd Qu.:13.00
## Max. :899.0 1995-03-01 00:00:00: 159 Max. :59.00
## NA's :591 (Other) :112389 NA's :7604
## OpenCreditLines TotalCreditLinespast7years OpenRevolvingAccounts
## Min. : 0.00 Min. : 2.00 Min. : 0.00
## 1st Qu.: 6.00 1st Qu.: 17.00 1st Qu.: 4.00
## Median : 9.00 Median : 25.00 Median : 6.00
## Mean : 9.26 Mean : 26.75 Mean : 6.97
## 3rd Qu.:12.00 3rd Qu.: 35.00 3rd Qu.: 9.00
## Max. :54.00 Max. :136.00 Max. :51.00
## NA's :7604 NA's :697
## OpenRevolvingMonthlyPayment InquiriesLast6Months TotalInquiries
## Min. : 0.0 Min. : 0.000 Min. : 0.000
## 1st Qu.: 114.0 1st Qu.: 0.000 1st Qu.: 2.000
## Median : 271.0 Median : 1.000 Median : 4.000
## Mean : 398.3 Mean : 1.435 Mean : 5.584
## 3rd Qu.: 525.0 3rd Qu.: 2.000 3rd Qu.: 7.000
## Max. :14985.0 Max. :105.000 Max. :379.000
## NA's :697 NA's :1159
## CurrentDelinquencies AmountDelinquent DelinquenciesLast7Years
## Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 0.0 1st Qu.: 0.000
## Median : 0.0000 Median : 0.0 Median : 0.000
## Mean : 0.5921 Mean : 984.5 Mean : 4.155
## 3rd Qu.: 0.0000 3rd Qu.: 0.0 3rd Qu.: 3.000
## Max. :83.0000 Max. :463881.0 Max. :99.000
## NA's :697 NA's :7622 NA's :990
## PublicRecordsLast10Years PublicRecordsLast12Months RevolvingCreditBalance
## Min. : 0.0000 Min. : 0.000 Min. : 0
## 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 3121
## Median : 0.0000 Median : 0.000 Median : 8549
## Mean : 0.3126 Mean : 0.015 Mean : 17599
## 3rd Qu.: 0.0000 3rd Qu.: 0.000 3rd Qu.: 19521
## Max. :38.0000 Max. :20.000 Max. :1435667
## NA's :697 NA's :7604 NA's :7604
## BankcardUtilization AvailableBankcardCredit DebtToIncomeRatio
## Min. :0.000 Min. : 0 Min. : 0.000
## 1st Qu.:0.310 1st Qu.: 880 1st Qu.: 0.140
## Median :0.600 Median : 4100 Median : 0.220
## Mean :0.561 Mean : 11210 Mean : 0.276
## 3rd Qu.:0.840 3rd Qu.: 13180 3rd Qu.: 0.320
## Max. :5.950 Max. :646285 Max. :10.010
## NA's :7604 NA's :7544 NA's :8554
## IncomeRange StatedMonthlyIncome LoanOriginalAmount
## $25,000-49,999:32192 Min. : 0 Min. : 1000
## $50,000-74,999:31050 1st Qu.: 3200 1st Qu.: 4000
## $100,000+ :17337 Median : 4667 Median : 6500
## $75,000-99,999:16916 Mean : 5608 Mean : 8337
## Not displayed : 7741 3rd Qu.: 6825 3rd Qu.:12000
## $1-24,999 : 7274 Max. :1750003 Max. :35000
## (Other) : 1427
## LoanOriginationDate LStatus year
## Min. :2005-11-15 good standing:94855 Min. :2005
## 1st Qu.:2008-10-02 late : 2067 1st Qu.:2008
## Median :2012-06-26 default :17010 Median :2012
## Mean :2011-07-21 NA's : 5 Mean :2011
## 3rd Qu.:2013-09-18 3rd Qu.:2013
## Max. :2014-03-12 Max. :2014
##
I will primarily focus on the borrower details, loan parameters, and prosper’s calculation of creditworthiness as they relate to the price of the loan, and also on the payment status of the loan (good, late, default).
Created two new variables - Year (year loan initiated) and LStatus (consolidated LoanStatus into 3 levels - good, late, default). In addition, I created ordered factors for IncomeRange and LoanStatus so these plots would be logical.
Over 80% of Prosper Loans are complete or current (LStaus = good standing: (95k / 114k).
These distributions are for variables related to income.
These distributions are related to credit-worthiness for Prosper borrowers and were selected to reflect the variety of variables in the dataset. These variables will be tuned in the bivariate analysis.
+ Debt to income ratio is a median of 0.22
+ Credit Line, 7 years has a median of 25
+ Current credit lines is 10.
the above plots all appear to be normal distributions.
+ Revolving Monthly Payment distributions is not normal more skewed to lower payments
+ Bankcard utilization is not normal, and high with a median of 60%.
(closer look at Revolving Monthly Payments does not reveal additional interesting information)
Loan amounts have distinct bands at increments of $5k with 5k, 10k and 15k being the most popular loan amounts.
APR also appear to group at slightly below 10%, 20%, and 30% with a peak at 0.36 which is primarily from 2011 and 2012.
The range of FICO scores (CreditScoreRangeUpper) has a small peak at ~ 550, and then a peak near the median of 699. The distribution on the ProsperScore is unusual with several spikes along the peak of the distribution.
After an SEC initiated in hiatus in 2009, the number of loans is growing at a strong rate. Source: https://en.wikipedia.org/wiki/Prosper_Marketplace
The following variables have the highest correlation to the APR(price) of the loan -
positive correlation (increase APR): Delinquencies Last 7 Years, Inquiries last 6 Months
negative correlation (decrease APR): Prosper Score, Credit Score Range Upper, Available Bankcard Credit, Loan Original Amount
Annual Income does have an effect on APR, borrower APR declines ~ 5% between 25k and 125k and then is no longer affected by income.
There does not appear to be any correlation between APR and the length of employment.
This set of charts that shows a linear relationship between ProsperScore and CreditScoreUpper above 6 / 700 respectively but not at lower credit ratings.
The plots of Prosper score and Credit Score indicate why this is the case - APR remains flat with CreditScoreUpper until about 579, and doesn’t decrease noticeably until 680-700. The ProsperScore plot is more linear.
##
## Pearson's product-moment correlation
##
## data: prosper$ProsperScore and prosper$CreditScoreRangeUpper
## t = 115.87, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3637793 0.3753979
## sample estimates:
## cor
## 0.369603
## 99%
## 11
These charts provide additional information from the correlation values to APR from the ggpairs calculation.
Loan Original Amount shows two peaks in APR at loans of 3k and at 7.5k
The curves flatten eventually indicating there is an upper and lower limit to the borrower APR, (although the loans are still issued at high risk values.)
The distribution of loans at various APRs is noticably different between borrowers w/ AvailableBankcardCredit = 0 and != 0. The rates are higher for borrowers w/ no Bankcard credit. Also, the peak at 0.36 is more noticeable for borrowers with no Bankcard credit, although the peak exists on both plots.
Note difference in y-scale
These charts indicate that the price of the loan does take into consideration the associated risk based on the borrowers credit scores. The prosperscore shows a larger relative difference between Loan Status result than the CreditScoreRangeUpper.
## prosper.na$LStatus: good standing
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 4.000 6.000 6.017 8.000 11.000 18410
## --------------------------------------------------------
## prosper.na$LStatus: late
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 5.000 5.064 7.000 11.000
## --------------------------------------------------------
## prosper.na$LStatus: default
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 4.000 5.000 5.427 7.000 11.000 10669
## prosper.na$LStatus: good standing
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 19.0 679.0 719.0 712.5 739.0 899.0 416
## --------------------------------------------------------
## prosper.na$LStatus: late
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 619.0 679.0 699.0 707.1 739.0 879.0
## --------------------------------------------------------
## prosper.na$LStatus: default
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 19.0 619.0 659.0 659.8 719.0 879.0 174
## prosper.na$LStatus: good standing
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00653 0.15020 0.20270 0.21180 0.27250 0.51230 25
## --------------------------------------------------------
## prosper.na$LStatus: late
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.06327 0.21830 0.27280 0.26740 0.32580 0.38490
## --------------------------------------------------------
## prosper.na$LStatus: default
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00864 0.18690 0.25440 0.25220 0.30910 0.50630
##
## Calls:
## m1: lm(formula = BorrowerAPR ~ ProsperScore, data = prosper.na)
## m2: lm(formula = BorrowerAPR ~ ProsperScore + CreditScoreRangeUpper,
## data = prosper.na)
## m3: lm(formula = BorrowerAPR ~ ProsperScore + CreditScoreRangeUpper +
## AvailableBankcardCredit, data = prosper.na)
## m4: lm(formula = BorrowerAPR ~ ProsperScore + CreditScoreRangeUpper +
## AvailableBankcardCredit + log10(LoanOriginalAmount), data = prosper.na)
## m6: lm(formula = BorrowerAPR ~ ProsperScore + CreditScoreRangeUpper +
## AvailableBankcardCredit + log10(LoanOriginalAmount) + InquiriesLast6Months,
## data = prosper.na)
##
## ======================================================================================
## m1 m2 m3 m4 m6
## --------------------------------------------------------------------------------------
## (Intercept) 0.360*** 0.730*** 0.701*** 0.872*** 0.858***
## (0.001) (0.003) (0.003) (0.003) (0.003)
## ProsperScore -0.022*** -0.018*** -0.018*** -0.017*** -0.016***
## (0.000) (0.000) (0.000) (0.000) (0.000)
## CreditScoreRangeUpper -0.001*** -0.001*** -0.000*** -0.000***
## (0.000) (0.000) (0.000) (0.000)
## AvailableBankcardCredit -0.000*** -0.000*** -0.000***
## (0.000) (0.000) (0.000)
## log10(LoanOriginalAmount) -0.060*** -0.059***
## (0.001) (0.001)
## InquiriesLast6Months 0.004***
## (0.000)
## --------------------------------------------------------------------------------------
## R-squared 0.447 0.537 0.540 0.591 0.596
## adj. R-squared 0.447 0.537 0.539 0.591 0.596
## sigma 0.059 0.054 0.054 0.051 0.051
## F 68477.863 49144.732 33136.421 30694.583 25035.388
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood 119107.488 126645.604 126904.499 131970.791 132458.413
## Deviance 299.890 251.072 249.545 221.456 218.925
## AIC -238208.976 -253283.207 -253798.997 -263929.583 -264902.825
## BIC -238180.930 -253245.813 -253752.254 -263873.490 -264837.385
## N 84853 84853 84853 84853 84853
## ======================================================================================
##
## Call:
## lm(formula = BorrowerAPR ~ ProsperScore + CreditScoreRangeUpper +
## AvailableBankcardCredit + log10(LoanOriginalAmount) + InquiriesLast6Months,
## data = prosper.na)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.196122 -0.035646 -0.005166 0.033676 0.211848
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.579e-01 3.431e-03 250.03 <2e-16 ***
## ProsperScore -1.635e-02 8.470e-05 -193.02 <2e-16 ***
## CreditScoreRangeUpper -4.307e-04 4.397e-06 -97.94 <2e-16 ***
## AvailableBankcardCredit -2.034e-07 1.093e-08 -18.60 <2e-16 ***
## log10(LoanOriginalAmount) -5.877e-02 5.756e-04 -102.10 <2e-16 ***
## InquiriesLast6Months 4.138e-03 1.321e-04 31.32 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0508 on 84847 degrees of freedom
## (29079 observations deleted due to missingness)
## Multiple R-squared: 0.596, Adjusted R-squared: 0.596
## F-statistic: 2.504e+04 on 5 and 84847 DF, p-value: < 2.2e-16
Several variables contribute to the Borrower APR (price of the loan), including ProsperScore CreditScoreRange, AvailableBankcardCrdit, LoanOriginalAmount, BankcardUtilization, Delinquencies(7year), CurrentDelinquincies, and Inquiries(6mos).
* Bankcard Utiliazation correlated to Available Bankcard Credit (-0.37)
* Current Delinquencies correlated to Delinquencies(7 year) (0.384)
* These variables were not included in the bivariate analysis nor the model to avoid issues with collinear variables
The linear model has an R^2 factor of 0.60. this is a moderate result and reflects applying a linear model to non-linear relationships between the variables (NOTE - log10 relationships were explored for certain variables and adopted for LoanOriginalAmount)
Income shows a weak correlation to BorrowerAPR (-0.14). it did not improve the model (R), and is not included in the model.
It is assumed that Borrower APR reflects the risk of the loan. A borrower has to pay more if they are more likely to default.
* The BorrowerAPR is a good reflection of loan risk - the median borrowerAPR forloans in good standing is 20% and 25.4% for loans in default.
*The ProsperScore appears to be a better predictor of loan status than CreditScoreUpper with a larger difference of scores based on LStatus
+ ProsperScore 6 (good) to 5 (default) = 16% + CreditScoreUpper 719 (good) to 659 (default) = 8%
While the difference in mean credit scores has decreased over time for loans in good v negative standing, the mean prosper score has diverged, indicating it has become a better predictor of whether a loan will be good or negative status
This set of plots investigates the % of a loan that is paid off. The first plot demonstrates loans with >1% of the total loan still due (effectively, all unpaid loans).
* the trend for higher APRs at good standing is due to the APR rate of the loan when it was issued and the length of time the loan has been open and one should not draw conclusions on this trend.
* the trend for default loans is interesting, as these loans are closed as unpaid. It is clear that borrowers pay more of the loan when then have a lower interest rate than when they have a higher interest rate! More investigation is needed to better understand this trend, but if it hold up, it has interesting implications when pricing loans. Financially it may be more beneficial to receive back more principle and less interest.
the set of charts of distributions with completed and uncompleted loans has an expected trend of higher completed loans, and loans in good standing distributed at lower APRs, and loans with negative standing distributed at higher APRs.
This graph plots the change in APR by Loan status by year. There are two notable trends in this graph.
* APRs are consistently and noticeably lower for Loans with positive status but this gap becomes very small in 2011 based on increased APRs for loans in good standing.
* In 2011, APRs for loans tihe begative status (late or default), declined very slightly from 2010, while te APR for loans in good standing increased as noted in the previous item.
The fluctuations of the APR by year also indicate the economic conditions - although it appears it takes longer for APRs to come down that it does for them to go up.
there is a very noticeable improvement in loan status, even during the economic downturn from Q4-2007 to Q2-2009 (note that the year is inital year of the loan). 2010 is the year that prosper reset after SEC intervention, and also when they introduced their ProsperScore.
These graphs of the key variables affecting Borrower APR now indicate the status of the loan (good, late, or default). The graphs support that the APR is appropriatedly assigned to price a higher risk more than a lower risk loan since good standing loans have lower rates than defauled loans. late loans are not consistent.
* Two notable observations
+ for larger loans (greater than 4.5 (log10)), loans with negative status have a lower APR than loans with good status this is not desirable
+ as inquires last 6 months increases, the APR for default loans remains exactly the same, regardless of the number of inquires where APR increased relative to number of inquires for good standing and for late loans.
This plot graphs the APR by the year they were initiated with their loan status. and provides information on how APRs have changed over time and the result on the loan status. It is important to consider the great recession from late 2007 to mid 2009 had a significant impact on credit for the early years of this graph. There are several notable trends in this graph.
I’m astounded and stoked about the level of information in this project.
The number of variables created a huge challenge. It took considerable effort to determine the focus of this project - specifically the borrowers creditworthiness, the APR, and the relationship between APR and Loan Status.
I spent several hours trying to correlate Loan Status with various factors, but the analysis was really uninteresting. Once I determined that the APR, or price of the loan, was the more important factor, and then consolidated the LoanStatus factors into Good, Late, Default, the project went very smoothly.
The data set is so rich, and a team of analyst are certainly busy at Prosper. This analysis focuses on the price of loans (APR) and their status compared to loan parameters and the borrowers creditworthiness.
A few items were surprising -
Points for further investigation